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Manfred Glesner, Thomas Hollstein, Leandro Soares Indrusiak, Peter Zipf, Thilo Pionteck, Mihail Petrov, Heiko 
Zimmer, Tudor Murgan 

April 2004 Proceedings of the first conference on computing frontiers on Computing frontiers 

Full text available?: ^^jjcitUVS-SAilti). Additional Information: .i^xjlafe s&SwStSt JS-sCEinCiSii. iiiuaX-^nas. 

Ubiquitous computing requires flexibilty. Melting distributed electronic devices into everyday's life implies the 
need to adapt to evolving standards and dynamic environments. Furthermore, to gain user acceptance, such 
devices should be able to adapt to different usage patterns and user profiles. Scalability is also an important 
issue, allowing functional enhancements to already deployed systems. In this work we address these issues 
applying the concept of reconfigurability on different abstract ... 

Keywords: communication, dynamic power management, networks-on-chip, reconfigurable hardware, 
reconfigurable processors, reconfiguration, ubiquitous computing 
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Additional Information: iulLcj&tlcf], abii'-aoi, i&tt&zz-i, Sitin^a, i;-4ax_£[o:is. 



Eclipse defines a heterogeneous multiprocessor architecture template for data-dependent stream processing. 
Intended as a scalable and flexible subsystem of forthcoming media-processing systems-on-a-chip, Eclipse 
combines application configuration flexibility with the efficiency of function-specific hardware, or 
coprocessors. To facilitate reuse, Eclipse separates coprocessor functionality from generic support that 
addresses multi-tasking, inter-task synchronization, and data transport. Fi ... 

3 Hardware^ 

Asawaree Kalavade, P. A. Subrahmanyam 

November 1997 Proceedings of the 1997 IEEE/ACM international conference on Computer-aided design 

Full text available: £$M 

jjdi(Ai 5fr KIT. t B " »::hiishs.- Additional Information: .cj&fcii , .a£s$»a& JS-steEica a. .C&ttj#>, i- ■£2SS.s£e£[£1S. 

We are interested in optimizing the design of multi-function embedded systems that run a pre-specified set 
of applications, such as multi-standard audio/video codecs and multi-system phones. Such systems usually 
have stringent performance constraints and tend to have mixed hardware-software implementations. The 
current state of the art in the hardware/software codesign of such systems is to design for each application 
separately. This often leads to application-specific sub-optimal decisions and ... 

Keywords: multi-function systems, hard ware- software codesign, hardware/software partitioning, system- 
level design, core-based design, video encode/decode 



4 QedlcMed.CircMi 

Timo Vogt, Norbert Wehn, Philippe Alves 

September 2004 Proceedings of the 17th symposium on Integrated circuits and system design 
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In this paper, a VLSI implementation of a multi-standard channel-decoder for EDGE, WCDMA, and CDMA2k 
convolutional-codes is presented. The new architecture employs the MAP algorithm for convolutional 
decoding to support soft-outputs.The decoder is designed for base-station applications. The maximum 
throughput of the decoder is 16 Mbps for WCDMA and CDMA2k, and 70 Mbps for EDGE, at a clock frequency 
of 200 MHz. 

Keywords: CDMA2k, MAP, W-CDMA, configurable, convolutional decoder, hardware sharing, wireless 

5 VLSI Design: On the h ig h speed VIS; implementation of errors-and -erasures correcting reed-so;omon 
decoders 

Tong Zhang, Keshab K. Parhi 

April 2002 Proceedings of the 12th ACM Great Lakes symposium on VLSI 

Full text available: ^^^till^USSKLS) Additional Information: &Ji.ai&tlC. r 3, &bisii5.at JS'iifctlvi^, j^CiSX&CCClS. 



Recently a novel algorithm transformation was proposed to reduce the critical path of Berlekamp-Massey 
algorithm implementation for errors-alone Reed-Solomon decoding. In this paper, we apply the same 
methodology to transform the Berlekamp-Massey algorithm for errors-and-erasures RS decoding. We present 
a regular hardware architecture to implement the reformulated Berlekamp-Massey algorithm, which can 
achieve high throughput. Moreover, an operation scheduling scheme is proposed to further reduce ... 

Keywords: Berlekamp-Massey algorithm, Reed-Solomon codes, VLSI architectures, erasure 
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A. Worm, H. Lamm, N. Wehn 

March 2001 Proceedings of the conference on Design, automation and test in Europe 

Full text available: ^j!liEjfti2k2IiSl3) Additional Information: MjEtSXiSil, .E&r£r:a&>. c&sss. \zszs:<3smiii 



7 



decoder 

Xun Liu, Marios C. Papaefthymiou 

June 2002 Proceedings of the 39th conference on Design automation 

Full text available: ■^j^liQlii^^ * Additional Information: fiilliiiiaiiCiG, aisstoiSi, cateie.n^a.s,, La&Kisjmsi. 



The design of high-throughput large-state Viterbi decoders relies on the use of multiple arithmetic units. The 
global communication channels among these parallel processors often consist of long interconnect wires, 
resulting in large area and high power consumption. In this paper, we propose a data-transfer oriented 
design methodology to implement a low-power 256-state rate-1/3 IS95 Viterbi decoder. Our architectural 
level scheme uses operation partitioning, packing, and scheduling to analyze an ... 



Keywords: bus reduction, communications, pipelining 
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F. Viglione, G. Masera, G. Piccinini, M. Ruo Roch, M. Zamboni 

January 2000 Proceedings of the conference on Design, automation and test in Europe 

Full text available: 

^|^tC^ai\^qw Additional Information: M.C:t8fej«, .csis-sns&s, skass, ta&ss&JJJJji 



9 Case 

R. Peset Llopis, R. Sethuraman, C. Alba Pinto, H. Peters, S. Maul, M. Oosterhuis 

October 2003 Proceedings of the 1st IEEE/ACM/IFIP international conference on Hard ware/ software 
codesign and system synthesis 

Full text available: ^|j3£tf&t:_2!lj&l3). Additional Information: iuJi.cjlatici', abis^aci, jEficsnciSji, iiivsxieifcas. 

Video encoders are an important IP block in mobile multimedia systems. In this paper, we describe a low- 
cost low-power multi-standard (MPEG4, JPEG, and H.263) video/image encoder. The low-cost and low-power 
aspects are achieved by the right choice of algorithms and architectures. In the algorithm front, an 
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embedded compression technique for reducing the size of loop memory has enabled a single-chip low-cost 
realization of the encoder. In the architectural front, an efficient hardware-software pa ... 

Keywords: ASIPs, hardware/software partitioning, low-cost, low-power, multi-standard, video encoder 

10 Architecture Implementation Using the Machine Description Language USA 
Oliver Schliebusch, Andreas Hoffmann, Achim Nohl, Gunnar Braun, Heinrich Meyr 

January 2002 Proceedings of the 2002 conference on Asia South Pacific design automation/VLSI 
Design 

Full text available: |^fl 

'P^.^i'^S-l'^.rlU'A ffijp gjjj&S: -ac Additional Information: .lvJ;.vJ:ato ac&trasi. 

The development of application specific instruction set processors comprises several design phases: 
architecture exploration, software tools design, system verification and design implementation. The LISA 
processor design platform LPDP based on machine descriptions in the LISA language provides one common 
environment for these design phases. Required software tools for architecture exploration and application 
development can be generated from one sole specification. This paper focuses on the imp ... 

Keywords: ASIP, Exploration, Implementation, Design, Synthesis, VHDL, Verilog, SystemC, LISA 
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Register file and memory s ystem desig n: D ynamic addressing memory arrays with physical locality 
Steven Hsu, Shin-Lien Lu, Shih-Chang Lai, Ram Krishnamurthy, Konrad Lai 

November 2002 Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture 

Full text available: i 

Add ition a I Information: iyj : .viialifi.1 , .aWitSl fit Jfifiit'ftllvSfi, ;f:dEX.tt*ilVOS. 



As pipeline width and depth grow to improve performance, memory arrays in microprocessors are growing in 
entries and ports. Arrays will increase in physical size, which prolongs the access time due to wiring delay. In 
order to boost clock frequency, these memory arrays must take multiple cycles to complete an access. This 
delays the scheduling of dependent instructions and affects overall performance. This paper proposes a 
different circuit organization to enable fast and slow accesses solely de ... 

Stefanos Kaxiras, Girija Narlikar, Alan D. Berenbaum, Zhigang Hu 

November 2001 Proceedings of the international conference on Compilers, architecture, and synthesis for 
embedded systems 

Full text available: j.TsfffiSLs.':. KB^ Additional Information: r^lUitfJ-Uu. al<$5l>iS;l JXi&!&IiZ$& tt:jt\ij5. ti&i^&Hh^ 

In the DSP world, many media workloads have to perform a specific amount of work in a specific period of 
time. This observation led us to examine Simultaneous Multithreading (SMT) and Chip Multiprocessing (CMP) 
for a VLIW DSP architecture (specifically the Star*Core SC140), in conjunction with Frequency/Voltage 
scaling to decrease dynamic power consumption in next-generation wireless handsets. We study the 
resulting performance and power characteristics of the two approaches using simulation, co ... 
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Manolis Katevenis, Panagiota Vatsolaki, Aristides Efthymiou 

October 1995 ACM SIGCOMM Computer Communication Review , Proceedings of the conference on 
Applications, technologies, architectures, and protocols for computer communication, 

Volume 25 Issue 4 

Full text available: witt* .".Z Wi) Additional Information: M'- ri'Afo:'., a'cs&utitA , refere r 



Switch chips are building blocks for computer and communication systems. Switches need internal buffering, 
because of output contention; shared buffering is known to perform better than multiple input queues or 
buffers, and the VLSI implementation of the former is not more expensive than the latter. We present a new 
organization for a shared buffer with its associated switching and cut-through functions. It is simpler and 
smaller than wide or interleaved organizations, and it is particula ... 

Keywords: crossbar switch, gigabit VLSI switch buffer, input queueing, multiport buffer, pipelined memory, 
shared buffering 
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T. Okuma, H. Tomiyama, A. Inoue, E. Fajar, H. Yasuura 

December 1998 Proceedings of the 11th international symposium on System synthesis 

Full text available: rjpjjj 
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15 Compiler-drivers cached code compression schemes for embedded jtp processors 
Sergei Y. Larin, Thomas M. Conte 

November 1999 Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture 

Full text available: , . , 

IspyJ ^..w^j,, r c -i. - Additional Information: iuIi.SJSata?i .Stcsai-asl JS-sCSiCtCiaji. .c;h)m ;iiU^^C[(LS. 

During the last 15 years, embedded systems have grown in complexity and performance to rival desktop 
systems. The architectures of these systems present unique challenges to processor microarchitecture, 
including instruction encoding and instruction fetch processes. This paper presents new techniques for 
reducing embedded system code size without reducing functionality. This approach is to extract the pipeline 
decoder logic for an embedded VLIW processor in software at system develo ... 

Jessica H. Tseng, Krste Asanovic 

May 2003 ACM SIGARCH Computer Architecture News , Proceedings of the 30th annual 
international symposium on Computer architecture, Volume 31 Issue 2 

Full text available: ^ffij pcjf{1<j?.^j KB) Additional Information: fd: digtlo" , i&sir&<£ jvfgrenras, CIiiDiJi 

Multiported register files are a critical component of high-performance superscalar microprocessors. 
Conventional multiported structures can consume significant power and die area. We examine the designs of 
banked multiported register files that employ multiple interleaved banks of fewer ported register cells to 
reduce power and area. Banked register files designs have been shown to provide sufficient bandwidth for a 
superscalar machine, but previous designs had complex control structures that w ... 
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A.DISE jn)i>iem^ 

Marc L. Corliss, E. Christopher Lewis, Amir Roth 

June 2003 ACM SIGPLAN Notices , Proceedings of the 2003 ACM SIGPLAN conference on Language, 
compiler, and tool for embedded systems, Volume 38 Issue 7 

Full text available: Additional Information: klLxi&ilGtO. a$3§&Bi;&, [St&!£J3i&& 

Code compression coupled with dynamic decompression is an important technique for both embedded and 
general-purpose microprocessors. Post-fetch decompression, in which decompression is performed after the 
compressed instructions have been fetched, allows the instruction cache to store compressed code but 
requires a highly efficient decompression implementation. We propose implementing post-fetch 
decompression using dynamic instruction stream editing (DISE), a programmable decoder-- ... 

Keywords: DISE, code compression, code decompression 
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John Teifel, Rajit Manohar 

February 2004 Proceeding of the 2004 ACM/SIGDA 12th international symposium on Field 
programmable gate arrays 

Full text available: ^ft^fCS.7.£.&B) Additional Information: ivJi.cMiSf:. .&&ti&i?X jfefsitftavfifi, if:C&Z.:aitt"is. 



We present the design of a high-performance, highly pipelined asynchronous FPGA. We describe a very fine- 
grain pipelined logic block and routing interconnect architecture, and show how asynchronous logic can 
efficiently take advantage of this large amount of pipelining. Our FPGA, which does not use a clock to 
sequence computations, automatically self-pipelines" its logic without the designer needing to be explicitly 
aware of all pipelining details. This property makes our FPGA ideal for throughp ... 

Keywords: asynchronous circuits, concurrency, correctness by construction, pipelining, programmable logic 
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Allowing for IIP in an embedded Java p rocessor IJ 
Ramesh Radhakrishnan, Deependra Talla, Lizy Kurian John 

May 2000 ACM SIGARCH Computer Architecture News , Proceedings of the 27th annual 
international symposium on Computer architecture, Volume 28 Issue 2 
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Full text available: ^j^jxtfffiS ?'Q ;<!5j, Additional Information: ftj:. iiiaticn , shs:.r3f;t , .referees, ciim Si&xi£lLU& 

Java processors are ideal for embedded and network computing applications such as Internet TV's, set-top 
boxes, smart phones, and other consumer electronics applications. In this paper, we investigate cost- 
effective microarchitectural techniques to exploit parallelism in Java bytecode streams. Firstly, we propose 
the use of a fill unit that stores decoded bytecodes into a decoded bytecode cache. This mechanism improves 
the fetch and decode bandwidth of Java processors by 2 to 3 time ... 

Ingrid Verbauwhede, Chris Nicol 

August 2000 Proceedings of the 2000 international symposium on Low power electronics and design 

Full text available: Additional Information: ^ifc&ao. £3&J3j& .K&XSfli&S, teLQftt&jnft 



Wireless communications and more specifically, the fast growing penetration of cellular phones and cellular 
infrastructure are the major drivers for the development of new programmable Digital Signal Processors 
(DSPs). In this tutorial, an overview will be given of recent developments in DSP processor architectures, 
that makes them well suited to execute computationally intensive algorithms typically found in 
communications systems. DSP processors have adapted instruction sets, memory archi ... 

Keywords: architectures, digital signal processing, programmable processors, wireless communications 
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21 A_32 : bjlCMOS.nicxo^ 

H. Kaneko, Y. Miki, S. Nohara, K, Koya, M. Araki 

November 1999 Proceedings of 1986 ACM Fall joint computer conference 

Full text available: >yjr'(-:& : 54 KB) Additional Information: 
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Circu i ts: complexity analysis and an efficient optima! algorithm 
Sangyun Kim, Peter A. Beerel 

November 2000 Proceedings of the 2000 IEEE/ACM international conference on Computer-aided design 

Full text available: ^jP) ocif f 11 3 .03 KB) Additional Information: ful! citation , attract , rareness 



This paper addresses the problem of identifying the minimal pipelining needed in an asynchronous circuit 
(e.g., number/size of pipeline stages/latches required) to satisfy a given performance constraint, thereby 
implicitly minimizing area and power for a given performance. In contrast to the somewhat analogous 
problem of retiming in the synchronous domain, we first show that the basic pipeline optimization problem 
for asynchronous circuits is NP-complete. This paper then presents an effic ... 

23 SessigiL§J.^ 

Matthew Moe, Herman Schmit 



April 2003 



Proceedings of the 2003 international symposium on Physical design 



Full text available: ^^jx-ft 5.^.25 >.\ 13) 



Additional Information: ^La&fe, attract jefsiKtivis-i, Sitings. inv&x.&cctts. 



Floorplanning individual pipelined array modules of a larger overall die can yield beneficial results. Critical 
paths in every pipeline stage of a pipelined design are roughly equivalent after synthesis. The inability of 
synthesis tools to predict without full placement both wire congestion and the distance traveled by a wire or 
wires between consecutive registers are the greatest causes of additional delay and area during place and 
route. This paper will detail a floorplanning methodology for p ... 

Keywords: floorplan, pipelined array, sequence pair 
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Hue-Sung Kim, Arun K. Somani, Akhilesh Tyagi 

February 2000 proceedings of the 2000 ACM/SIGDA eighth international symposium on Field 
programmable gate arrays 
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A considerable portion of a chip is dedicated to a cache memory in a modern microprocessor chip. However, 
some applications may not actively need all the cache storage, especially the computing bandwidth limited 
applications. Instead, such applications may be able to use some additional computing resources. If the 
unused portion of the cache could serve these computation needs, the on-chip resources would be utilized 
more efficiently. This presents an opportunity to explore the reconfigurat ... 
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Increasing nonrecurring engineering and mask costs are making it harder to turn to hardwired application 
specific integrated circuit (ASIC) solutions for high-performance applications. The volume required to 
amortize these high costs has been increasing, making it increasingly expensive to afford ASIC solutions for 
medium-volume products. This has led to designers seeking programmable solutions of varying sorts using 
these so-called programmable platforms. These programmable platforms span a lar ... 

Keywords: Loop pipelining, coarse-grain reconfjgurable fabric, datapath synthesis, interconnection design, 
reconfjgurable datapath 
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This paper describes a new implementation of the ST20-C2 CPU architecture. The design involves an eight- 
stage pipeline with hardware support to execute up to three instructions in a cycle. Branch prediction is 
based on a 2-bit predictor scheme with a 1024-entry Branch History Table and a 64 entry Branch Target 
Buffer and a 4-entry Return Stack. The implementation of all blocks in the processor was based on 
synthesized logic generation and automatic place and route. The full design of the CPU fro ... 

Keywords: ASIC, CPU, branch-prediction, cache, embedded, high-frequency, microarchitecture, pipeline, 
st20, synthesis 
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Two major limitations concerning the design of cost-effective application-specific architectures are the 
recurrent costs of system-software development and hardware implementation, in particular VLSI 
implementation, for each architecture. The scalable ARChitecture Experiment (SCARCE) aims to provide a 
framework for application-specific processor design. The framework allows scaling of functionality, 
implementation complexity, and performance. The SCARCE framework consists and wil ... 
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With the emergence of a plethora of embedded and portable applications, energy dissipation has joined 
throughput, area, and accuracy/precision as a major design constraint. Thus, designers must be concerned 
with both optimizing and estimating the energy consumption of circuits, architectures, and software. Most of 
the research in energy optimization and/or estimation has focused on single components of the system and 
has not looked across the interacting spectrum of the hardware and softwar ... 
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This paper proposes a new bipatition-codec architecture that may reduce power consumption of pipelined 
circuits. We treat each output value of a pipelined circuit as one state of a FSM. If the output of a pipelined 
circuit transit mainly among few states, we could partition the combinational portion of a pipelined circuit into 
two blocks: one that contains the few states of high activity is small and the other that contains the 
remainder of low activity is big. Consequently, the state trans ... 
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Application-specific instructions can significantly improve the performance, energy, and code size of 
configurable processors. A common approach used in the design of such instructions is to convert 
application-specific operation patterns into new complex instructions. However, processors with a fixed 
instruction bitwidth cannot accommodate all the potentially interesting operation patterns, due to the limited 
code space afforded by the fixed instruction bitwidth. We present a novel instruction ... 
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Typically, good automated ASIC designs may be two to five times slower than handcrafted custom designs. 
At last year's DAC this was examined and causes of the speed gap between custom circuits and ASICs were 
identified. In particular, faster custom speeds are achieved by a combination of factors: good architecture 
with well-balanced pipelines; compact logic design; timing overhead minimization; careful floorplanning, 
partitioning and placement; dynamic logic; post-layout transistor and wire ... 
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Superscalar processors obtain their performance by exploiting instruction level parallelism in programs. Their 
performance is therefore limited by characteristics of programs and the design of the processor. Due to the 
complexity involved, estimating the performance of any superscalar processor design is a difficult task. Quick 
prediction of performance improvement arising from architecture modifications is even more difficult. In this 
paper, a model of superscalar processors using a network of ... 
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The need to perform early design studies that combine architectural simulation with power estimation has 
become critical as power has become a design constraint whose importance has moved to the fore. To satisfy 
this demand several microarchitectural power simulators have been developed around SimpleScalar, a widely 
used microarchitectural performance simulator. They have proven to be very useful at providing insights into 
power/performance trade-offs. However, they are neither parameterized nor ... 
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Dynamic Instruction Stream Editing (DISE) is a cooperative software- hardware scheme for efficiently 
adding customization functionality— e.g, safety/security checking, profiling, dynamic code decompression, 
and dynamic optimization-— to an application. In DISE, application customization functions (ACFs) are 
formulated as rules for macro-expanding certain instructions into parameterized instruction sequences. The 
processor executes the rules on the fetched instructions, feeding the executi ... 
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The doubling of microprocessor performance every three years has been the result of two factors: more 
transistors per chip and superlinear scali ng of the processor clock with technology generation. Our results 
show that, due to both diminishing improvements in clock rates and poor wire scaling as semiconductor 
devices shrink, the achievable performance growth of conventional microarchitectures will slow substantially. 
In this paper, we describe technology-driven models for wire cap ... 
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The development of application specific instruction set processors (ASIP) is currently the exclusive domain of 
the semiconductor houses and core vendors. This is due to the fact that building such an architecture is a 
difficult task that requires expertise knowledge in different domains: application software development tools, 
processor hardware implementation, and system integration and verification. This paper presents a 
retargetable framework for ASIP design which is based on machine descript ... 
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As the issue width of superscalar processors is increased, instruction fetch bandwidth requirements will also 
increase. It will become necessary to fetch multiple basic blocks per cycle. Conventional instruction caches 
hinder this effort because long instruction sequences are not always in contiguous cache locations. We 
propose supplementing the conventional instruction cache with a trace cache. This structure caches traces of 
the dynamic instruction stream, so instructions that are otherwise no ... 
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